Consistent Subset Sampling
نویسندگان
چکیده
Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I ⊆ U it should be possible to quickly compute I ∩ S, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(b). Using a carefully designed hash function, for a given sampling probability p ∈ (0, 1], we show how to improve the time complexity to Θ(bdk/2e log log b + pb) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(bdk/4e). We demonstrate the utility of our technique by applying it to several wellstudied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.
منابع مشابه
OPTIMIZATION OF SKELETAL STRUCTURES USING IMPROVED GENETIC ALGORITHM BASED ON PROPOSED SAMPLING SEARCH SPACE IDEA
In this article, by Partitioning of designing space, optimization speed is tried to be increased by GA. To this end, designing space search is done in two steps which are global search and local search. To achieve this goal, according to meshing in FEM, firstly, the list of sections is divided to specific subsets. Then, intermediate member of each subset, as representative of subset, is defined...
متن کاملPhase-space overlap measures. II. Design and implementation of staging methods for free-energy calculations.
We consider staged free-energy calculation methods in the context of phase-space overlap relations, and argue that the selection of work-based methods should be guided by consideration of the phase-space overlap of the systems of interest. Stages should always be constructed such that work is performed only into a system that has a phase-space subset relation with the starting system. Thus mult...
متن کاملOn the Variance of Subset Sum Estimation
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries to arbitrary subset sums. With unit weights, we can compute subset sizes which together with the previous sums provide the subset averages. The question addre...
متن کاملRange-Efficient Consistent Sampling and Locality-Sensitive Hashing for Polygons
Locality-sensitive hashing (LSH) is a fundamental technique for similarity search and similarity estimation in high-dimensional spaces. The basic idea is that similar objects should produce hash collisions with probability significantly larger than objects with low similarity. We consider LSH for objects that can be represented as point sets in either one or two dimensions. To make the point se...
متن کاملRELIABILITY–BASED DESIGN OPTIMIZATION OF CONCRETE GRAVITY DAMS USING SUBSET SIMULATION
The paper deals with the reliability–based design optimization (RBDO) of concrete gravity dams subjected to earthquake load using subset simulation. The optimization problem is formulated such that the optimal shape of concrete gravity dam described by a number of variables is found by minimizing the total cost of concrete gravity dam for the given target reliability. In order to achieve this p...
متن کامل